Final Project Proposal

finalpart2

Initial proposal for my final project

Author

Lindsay Jones

Published

November 11, 2022

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(dplyr)

Part 1

Research Question

In the United States, wage stagnation has become a hot-button issue for many people in various fields of employment. Graduate students have been at the center of this issue in recent years- strikes for wage increases and cost-of-living adjustments have taken place at multiple universities throughout the country. Because PhD students often do not have the time to earn extra income (and their contracts often prohibit them from pursuing work elsewhere), how much they will earn from their stipend is a huge factor in considering where to pursue their research (Powell, 2004; Soar et al., 2022). Knowing how much My research question is: Is university ownership status (public vs. private) a predictor of the value of a PhD stipend?

Hypothesis

H₀:University ownership status is a predictor of the value of a PhD stipend.

H₁: University ownership status is not a predictor of the value of a PhD stipend.

Dataset

This dataset is comprised of self-reported survey data collected by PhDStipends.com. Respondents are asked their university, department, academic year, and year in the program. They are also asked whether they receive a 12-month or 9-month salary, gross pay, and required fees. PhDStipends automatically calculates the LW Ratio (living wage ratio), which is the stipend divided by the living wage of the county the university is located in.

In addition to this information, I also manually categorized universities by their ownership status as public or private, and assigned each program to 1 of five broader academic disciplines: Business/Policy, Social Science, Natural Science, Formal Science, and Humanities. Due to a computer issue much of my work was lost, so the dataset is currently incomplete. The analysis that follows is based on the information I was able to recover or reenter within a reasonable period of time.

The variables of interest for me are the ownership status, gross pay, program year, and academic discipline.

Code

library(readr)
csv <- read_csv("~/School/UMASS/DACSS 603/Final Project/csv.csv")

Rows: 12160 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): University, Status, Department, Category, AcYear
dbl (7): Pay, LW Ratio, ProgYear, 12 M Gross Pay, 9 M Gross Pay, 3 M Gross P...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

summary(csv)

  University           Status           Department          Category        
 Length:12160       Length:12160       Length:12160       Length:12160      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
      Pay           LW Ratio        AcYear             ProgYear   
 Min.   :    1   Min.   :0.000   Length:12160       Min.   :1.00  
 1st Qu.:20000   1st Qu.:0.880   Class :character   1st Qu.:1.00  
 Median :26000   Median :1.130   Mode  :character   Median :1.00  
 Mean   :25765   Mean   :1.095                      Mean   :2.05  
 3rd Qu.:31500   3rd Qu.:1.330                      3rd Qu.:3.00  
 Max.   :96000   Max.   :4.120                      Max.   :6.00  
 NA's   :47      NA's   :422                        NA's   :1221  
 12 M Gross Pay   9 M Gross Pay   3 M Gross Pay        Fees      
 Min.   :     1   Min.   :   15   Min.   :    4   Min.   :    1  
 1st Qu.: 24000   1st Qu.:16500   1st Qu.: 3000   1st Qu.:  500  
 Median : 29000   Median :20000   Median : 5000   Median : 1000  
 Mean   : 28474   Mean   :20128   Mean   : 5194   Mean   : 2030  
 3rd Qu.: 33000   3rd Qu.:24000   3rd Qu.: 6204   3rd Qu.: 2000  
 Max.   :140000   Max.   :87467   Max.   :55816   Max.   :93725  
 NA's   :3632     NA's   :8551    NA's   :10951   NA's   :7404

Code

print(summarytools::dfSummary(csv,
                              varnumbers = FALSE,
                              plain.ascii  = FALSE,
                              style        = "grid",
                              graph.magnif = 0.70,
                              valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

csv

Dimensions: 12160 x 12
Duplicates: 339

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

University [character]

1. University of Wisconsin -

2. Duke University (DU)

3. University of North Carol

4. University of California

5. University of California,

6. University of Michigan -

7. University of Pennsylvani

8. University of Southern Ca

9. Pennsylvania State Univer

10. University of Minnesota -

[ 390 others ]

230	(	1.9%	)
208	(	1.7%	)
206	(	1.7%	)
205	(	1.7%	)
204	(	1.7%	)
195	(	1.6%	)
193	(	1.6%	)
191	(	1.6%	)
190	(	1.6%	)
179	(	1.5%	)
10159	(	83.5%	)

0 (0.0%)

Status [character]

1. Private

2. Public

4236	(	34.8%	)
7924	(	65.2%	)

0 (0.0%)

Department [character]

1. Chemistry

2. Psychology

3. Sociology

4. Computer Science

5. Physics

6. English

7. Political Science

8. Biology

9. Economics

10. Biomedical Engineering

[ 2916 others ]

530	(	4.5%	)
391	(	3.3%	)
323	(	2.7%	)
322	(	2.7%	)
292	(	2.5%	)
289	(	2.4%	)
286	(	2.4%	)
266	(	2.2%	)
197	(	1.7%	)
196	(	1.7%	)
8747	(	73.9%	)

321 (2.6%)

Category [character]

1. #N/A

2. 0

3. Business/Policy

4. Formal Science

5. Humanities

6. Natural Science

7. Social Science

3310	(	27.2%	)
625	(	5.1%	)
211	(	1.7%	)
1658	(	13.6%	)
919	(	7.6%	)
3435	(	28.2%	)
2002	(	16.5%	)

0 (0.0%)

Pay [numeric]

Mean (sd) : 25765.1 (9125.4)

min ≤ med ≤ max:

1 ≤ 26000 ≤ 96000

IQR (CV) : 11500 (0.4)

3420 distinct values

47 (0.4%)

LW Ratio [numeric]

Mean (sd) : 1.1 (0.4)

min ≤ med ≤ max:

0 ≤ 1.1 ≤ 4.1

IQR (CV) : 0.5 (0.3)

253 distinct values

422 (3.5%)

AcYear [character]

1. 2020-2021
2. 2016-2017
3. 2018-2019
4. 2019-2020
5. 2021-2022
6. 2017-2018
7. 2022-2023
8. 2014-2015
9. 2015-2016
10. 2013-2014
[ 14 others ]

2657	(	21.9%	)
1959	(	16.1%	)
1708	(	14.0%	)
1347	(	11.1%	)
1194	(	9.8%	)
1111	(	9.1%	)
998	(	8.2%	)
524	(	4.3%	)
395	(	3.2%	)
90	(	0.7%	)
175	(	1.4%	)

2 (0.0%)

ProgYear [numeric]

Mean (sd) : 2 (1.5)

min ≤ med ≤ max:

1 ≤ 1 ≤ 6

IQR (CV) : 2 (0.7)

1	:	6185	(	56.5%	)
2	:	1518	(	13.9%	)
3	:	1191	(	10.9%	)
4	:	951	(	8.7%	)
5	:	740	(	6.8%	)
6	:	354	(	3.2%	)

1221 (10.0%)

12 M Gross Pay [numeric]

Mean (sd) : 28473.9 (9013.8)

min ≤ med ≤ max:

1 ≤ 29000 ≤ 140000

IQR (CV) : 9000 (0.3)

1608 distinct values

3632 (29.9%)

9 M Gross Pay [numeric]

Mean (sd) : 20128.2 (7100.4)

min ≤ med ≤ max:

15 ≤ 20000 ≤ 87467

IQR (CV) : 7500 (0.4)

1046 distinct values

8551 (70.3%)

3 M Gross Pay [numeric]

Mean (sd) : 5194.4 (3370.8)

min ≤ med ≤ max:

4 ≤ 5000 ≤ 55816

IQR (CV) : 3204 (0.6)

308 distinct values

10951 (90.1%)

Fees [numeric]

Mean (sd) : 2030.1 (4711.9)

min ≤ med ≤ max:

1 ≤ 1000 ≤ 93725

IQR (CV) : 1500 (2.3)

985 distinct values

7404 (60.9%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-13

Part 2

Visualizations

I’ll start with a histogram of all stipends, regardless of university ownership status.

Code

viz <- csv %>% filter(Status %in% c("Public", "Private")) 

hist(viz$Pay, breaks = 10)

The distribution appears somewhat normal, with annual pay most frequently in the range of $20,000 to $30,000 annually.

Next I will generate 2 boxplots: one for public universities, and one for private.

Code

viz %>%
  ggplot(
    aes(x=Status, y=Pay, fill=Status)) +
    geom_boxplot()

Warning: Removed 47 rows containing non-finite values (stat_boxplot).

There are quite a few outliers for both categories, but we can see that median pay is higher in private universities than in public universities. There are also significantly more outliers below the 1st quartile in private universities than in public.

Hypothesis Testing

Explanatory Variable: Ownership Status (Status)
Response Variable: Gross Pay (Pay)
Control Variable: Academic Discipline (Category), Program Year (ProgYear)

First I will run a model for gross pay, using as.factor() to convert ownership status into dummy variables.

Code

fit1=lm(Pay ~ as.factor(Status), data = csv)
summary(fit1)


Call:
lm(formula = Pay ~ as.factor(Status), data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-30328  -4729    668   4918  72671 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              30332.2      130.8  231.81   <2e-16 ***
as.factor(Status)Public  -7003.5      162.0  -43.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8494 on 12111 degrees of freedom
  (47 observations deleted due to missingness)
Multiple R-squared:  0.1336,    Adjusted R-squared:  0.1336 
F-statistic:  1868 on 1 and 12111 DF,  p-value: < 2.2e-16

Based on the p-values, it does seem that ownership status is statistically significant with regards to pay. Now I will plot this model.

Next I will create a model adding the control variable “Category” (academic discipline).

Code

fit2=lm(Pay ~ as.factor(Status) + Category, data = csv)
summary(fit2)


Call:
lm(formula = Pay ~ as.factor(Status) + Category, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-32473  -4293    375   4701  72535 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              30191.0      176.8 170.757  < 2e-16 ***
as.factor(Status)Public  -7118.8      159.0 -44.770  < 2e-16 ***
Category0                  519.7      362.9   1.432 0.152228    
CategoryBusiness/Policy   2089.8      592.7   3.526 0.000424 ***
CategoryFormal Science     393.0      250.6   1.568 0.116838    
CategoryHumanities       -3447.6      310.8 -11.093  < 2e-16 ***
CategoryNatural Science   2371.5      202.7  11.697  < 2e-16 ***
CategorySocial Science   -1892.2      236.0  -8.019 1.16e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8310 on 12105 degrees of freedom
  (47 observations deleted due to missingness)
Multiple R-squared:  0.1713,    Adjusted R-squared:  0.1708 
F-statistic: 357.4 on 7 and 12105 DF,  p-value: < 2.2e-16

Formal Science, Humanities, and Natural Science all appear to be statistically significant. However, “Category0” is likely skewing the data, as this includes degree programs I have yet to assign to a category. The R-squared value here is higher than the previous model; however, due to the incomplete data, I will take this with a grain of salt.

Next I will create a model adding the control variable “ProgYear” (program year).

Code

fit3=lm(Pay ~ as.factor(Status) + ProgYear, data = csv)
summary(fit3)


Call:
lm(formula = Pay ~ as.factor(Status) + ProgYear, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-30564  -4621    469   4969  72741 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             30719.46     177.92 172.659   <2e-16 ***
as.factor(Status)Public -7052.72     169.55 -41.597   <2e-16 ***
ProgYear                 -135.82      55.31  -2.456   0.0141 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8455 on 10896 degrees of freedom
  (1261 observations deleted due to missingness)
Multiple R-squared:  0.1374,    Adjusted R-squared:  0.1372 
F-statistic: 867.5 on 2 and 10896 DF,  p-value: < 2.2e-16

Program year does appear to be statistically significant. R-squared is comparable to the original model.

Finally, I will create a model using both control variables.

Code

fit4=lm(Pay ~ as.factor(Status) + Category + ProgYear, data = csv)
summary(fit4)


Call:
lm(formula = Pay ~ as.factor(Status) + Category + ProgYear, data = csv)

Residuals:
   Min     1Q Median     3Q    Max 
-31798  -4232    305   4704  72735 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             30664.55     217.30 141.118  < 2e-16 ***
as.factor(Status)Public -7168.59     166.29 -43.110  < 2e-16 ***
Category0                 701.14     378.22   1.854  0.06380 .  
CategoryBusiness/Policy  2311.60     644.93   3.584  0.00034 ***
CategoryFormal Science    414.79     260.64   1.591  0.11154    
CategoryHumanities      -3359.96     330.50 -10.166  < 2e-16 ***
CategoryNatural Science  2514.91     212.68  11.825  < 2e-16 ***
CategorySocial Science  -1869.95     247.92  -7.543 4.97e-14 ***
ProgYear                 -215.27      54.28  -3.966 7.35e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8265 on 10890 degrees of freedom
  (1261 observations deleted due to missingness)
Multiple R-squared:  0.1762,    Adjusted R-squared:  0.1756 
F-statistic: 291.1 on 8 and 10890 DF,  p-value: < 2.2e-16

In this model, the disciplines of Business/Policy and Formal Science are the only ones which are not statistically significant.

Code

par(mfrow= c(2,3)); plot(fit1, which=1:6)

Code

par(mfrow= c(2,3)); plot(fit2, which=1:6)

Code

par(mfrow= c(2,3)); plot(fit3, which=1:6)

Code

par(mfrow= c(2,3)); plot(fit4, which=1:6)

The large number of categorical variables in my data makes plotting any model challenging, but from what I can see the fit is not great for any model. I am curious if a logit model would produce better results.

Summary

I need to reevaluate some of my variables and data and see if I can come up with a way to transform the data so that the models can be improved. I may experiment with relevel() and see if that has any effect. I also have yet to try an F-test.

Also, as previously mentioned, my data is incomplete- finishing the categorization of each degree program may improve my results.

References

Living Wage Calculator. (n.d.). Retrieved October 10, 2022, from https://livingwage.mit.edu/

Powell, K. Stipend survival. Nature 428, 102–103 (2004). https://doi.org/10.1038/nj6978-102a

Emily Roberts & Kyle Roberts. (2022, October 10). PhD stipends Dataset. http://www.phdstipends.com/csv

Soar, M., Stewart, L., Nissen, S. et al. Sweat Equity: Student Scholarships in Aotearoa New Zealand’s Universities. NZ J Educ Stud (2022). https://doi.org/10.1007/s40841-022-00244-5